Goto

Collaborating Authors

 scientific text


SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts

Brinner, Marc, Zarriess, Sina

arXiv.org Artificial Intelligence

We introduce SemCSE, an unsupervised method for learning semantic embeddings of scientific texts. Building on recent advances in contrastive learning for text embeddings, our approach leverages LLM-generated summaries of scientific abstracts to train a model that positions semantically related summaries closer together in the embedding space. This resulting objective ensures that the model captures the true semantic content of a text, in contrast to traditional citation-based approaches that do not necessarily reflect semantic similarity. To validate this, we propose a novel benchmark designed to assess a model's ability to understand and encode the semantic content of scientific texts, demonstrating that our method enforces a stronger semantic separation within the embedding space. Additionally, we evaluate SemCSE on the comprehensive SciRepEval benchmark for scientific text embeddings, where it achieves state-of-the-art performance among models of its size, thus highlighting the benefits of a semantically focused training approach.


THM@SimpleText 2025 -- Task 1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts

Hofmann, Nico, Dauenhauer, Julian, Dietzler, Nils Ole, Idahor, Idehen Daniel, Kreutz, Christin Katharina

arXiv.org Artificial Intelligence

Scientific text is complex as it contains technical terms by definition. Simplifying such text for non-domain experts enhances accessibility of innovation and information. Politicians could be enabled to understand new findings on topics on which they intend to pass a law, or family members of seriously ill patients could read about clinical trials. The SimpleText CLEF Lab focuses on exactly this problem of simplification of scientific text. Task 1.1 of the 2025 edition specifically handles the simplification of complex sentences, so very short texts with little context. To tackle this task we investigate the identification of complex terms in sentences which are rephrased using small Gemini and OpenAI large language models for non-expert readers.


Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences

Seidlmayer, Eva, Galke, Lukas, Förstner, Konrad U.

arXiv.org Artificial Intelligence

Disseminators of disinformation often seek to attract attention or evoke emotions - typically to gain influence or generate revenue - resulting in distinctive rhetorical patterns that can be exploited by machine learning models. In this study, we explore linguistic and rhetorical features as proxies for distinguishing disinformative texts from other health and life-science text genres, applying both large language models and classical machine learning classifiers. Given the limitations of existing datasets, which mainly focus on fact checking misinformation, we introduce Four Shades of Life Sciences (FSoLS): a novel, labeled corpus of 2,603 texts on 14 life-science topics, retrieved from 17 diverse sources and classified into four categories of life science publications. The source code for replicating, and updating the dataset is available on GitHub: https://github.com/EvaSeidlmayer/FourShadesofLifeSciences


Comparative Analysis of OpenAI GPT-4o and DeepSeek R1 for Scientific Text Categorization Using Prompt Engineering

Maiti, Aniruddha, Adewumi, Samuel, Tikure, Temesgen Alemayehu, Wang, Zichun, Sengupta, Niladri, Sukhanova, Anastasiia, Jana, Ananya

arXiv.org Artificial Intelligence

This study examines how large language models categorize sentences from scientific papers using prompt engineering. We use two advanced web-based models, GPT-4o (by OpenAI) and DeepSeek R1, to classify sentences into predefined relationship categories. DeepSeek R1 has been tested on benchmark datasets in its technical report. However, its performance in scientific text categorization remains unexplored. To address this gap, we introduce a new evaluation method designed specifically for this task. We also compile a dataset of cleaned scientific papers from diverse domains. This dataset provides a platform for comparing the two models. Using this dataset, we analyze their effectiveness and consistency in categorization.


On the Effectiveness of Large Language Models in Automating Categorization of Scientific Texts

Shahi, Gautam Kishore, Hummel, Oliver

arXiv.org Artificial Intelligence

The amount of scholarly texts is consistently increasing; around 2.5 million research articles are published yearly (Rabby et al., 2024). Due to this enormous increase, the classification of (scientific) texts has been attracting even more attention in recent years (Born-mann et al., 2021). Classifying the research area of scientific texts requires significant domain knowledge in various complex research fields. Hence, manual classification is challenging and time-consuming for librarians and limits the number of texts that can be classified manually (Zhang et al., 2023). Moreover, due to complex hierarchical classification schemes and their existing variety, classification of publications is also an unbeloved activity for researchers. Prominent examples of classification schemes include the Open Research Knowledge Graph (ORKG) (Auer and Mann, 2019), Microsoft Academic Graph (Wang et al., 2020), the Semantic Scholar Academic Graph (Kinney et al., 2023), ACM computing classification system (Rous, 2012), Dewey Decimal Classification (DDC) (Scott, 1998), and the ACL Anthology (Bird et al., 2008).


ByteScience: Bridging Unstructured Scientific Literature and Structured Data with Auto Fine-tuned Large Language Model in Token Granularity

Xie, Tong, Zhang, Hanzhi, Wang, Shaozhou, Wan, Yuwei, Razzak, Imran, Kit, Chunyu, Zhang, Wenjie, Hoex, Bram

arXiv.org Artificial Intelligence

Natural Language Processing (NLP) is widely used to supply summarization ability from long context to structured information. However, extracting structured knowledge from scientific text by NLP models remains a challenge because of its domain-specific nature to complex data preprocessing and the granularity of multi-layered device-level information. To address this, we introduce ByteScience, a non-profit cloud-based auto fine-tuned Large Language Model (LLM) platform, which is designed to extract structured scientific data and synthesize new scientific knowledge from vast scientific corpora. The platform capitalizes on DARWIN, an open-source, fine-tuned LLM dedicated to natural science. The platform was built on Amazon Web Services (AWS) and provides an automated, user-friendly workflow for custom model development and data extraction. The platform achieves remarkable accuracy with only a small amount of well-annotated articles. This innovative tool streamlines the transition from the science literature to structured knowledge and data and benefits the advancements in natural informatics.


Fine-Tuning Large Language Models for Scientific Text Classification: A Comparative Study

Rostam, Zhyar Rzgar K, Kertész, Gábor

arXiv.org Artificial Intelligence

The exponential growth of online textual content across diverse domains has necessitated advanced methods for automated text classification. Large Language Models (LLMs) based on transformer architectures have shown significant success in this area, particularly in natural language processing (NLP) tasks. However, general-purpose LLMs often struggle with domain-specific content, such as scientific texts, due to unique challenges like specialized vocabulary and imbalanced data. In this study, we fine-tune four state-of-the-art LLMs BERT, SciBERT, BioBERT, and BlueBERT on three datasets derived from the WoS-46985 dataset to evaluate their performance in scientific text classification. Our experiments reveal that domain-specific models, particularly SciBERT, consistently outperform general-purpose models in both abstract-based and keyword-based classification tasks. Additionally, we compare our achieved results with those reported in the literature for deep learning models, further highlighting the advantages of LLMs, especially when utilized in specific domains. The findings emphasize the importance of domain-specific adaptations for LLMs to enhance their effectiveness in specialized text classification tasks.


Steering AI-Driven Personalization of Scientific Text for General Audiences

Kim, Taewook, Agarwal, Dhruv, Ackerman, Jordan, Saha, Manaswi

arXiv.org Artificial Intelligence

Digital media platforms (e.g., social media, science blogs) offer opportunities to communicate scientific content to general audiences at scale. However, these audiences vary in their scientific expertise, literacy levels, and personal backgrounds, making effective science communication challenging. To address this challenge, we designed TranSlider, an AI-powered tool that generates personalized translations of scientific text based on individual user profiles (e.g., hobbies, location, and education). Our tool features an interactive slider that allows users to steer the degree of personalization from 0 (weakly relatable) to 100 (strongly relatable), leveraging LLMs to generate the translations with given degrees. Through an exploratory study with 15 participants, we investigated both the utility of these AI-personalized translations and how interactive reading features influenced users' understanding and reading experiences. We found that participants who preferred higher degrees of personalization appreciated the relatable and contextual translations, while those who preferred lower degrees valued concise translations with subtle contextualization. Furthermore, participants reported the compounding effect of multiple translations on their understanding of scientific content. Given these findings, we discuss several implications of AI-personalized translation tools in facilitating communication in collaborative contexts.


KALE-LM: Unleash The Power Of AI For Science Via Knowledge And Logic Enhanced Large Model

Dai, Weichen, Chen, Yezeng, Dai, Zijie, Huang, Zhijie, Liu, Yubo, Pan, Yixuan, Song, Baiyang, Zhong, Chengli, Li, Xinhe, Wang, Zeyu, Feng, Zhuoying, Zhou, Yi

arXiv.org Artificial Intelligence

In recent years, the rapid development of artificial intelligence (AI) technology has enabled it to achieve, and in some cases surpass, top human performance in various high-intelligence tasks. These include recognition in speech [1], facial [2], and image [3], games such as Go [4], StarCraft [5], and Dota2 [6], as well as tasks related to text [7], image [8], and video generation, machine translation [9], knowledge-based question answering [10], debates, and solving advanced mathematical problems [11]. Science is one of the most important fields for the application of AI. As the crown jewel of human civilization and the cornerstone of various industries, science is a core driver of human progress, and its development can significantly accelerate and even revolutionize many fields. Historically, there have been three major research paradigms in science: the first paradigm, experiment, which emerged from Newtonian empiricism; the second paradigm, theory, born from Einstein's rationalism; and the third paradigm, simulation/computation, which arose from the third industrial revolution, the computation and information revolution.


Towards understanding evolution of science through language model series

Dong, Junjie, Lyu, Zhuoqi, Ke, Qing

arXiv.org Artificial Intelligence

We introduce AnnualBERT, a series of language models designed specifically to capture the temporal evolution of scientific text. Deviating from the prevailing paradigms of subword tokenizations and "one model to rule them all", AnnualBERT adopts whole words as tokens and is composed of a base RoBERTa model pretrained from scratch on the full-text of 1.7 million arXiv papers published until 2008 and a collection of progressively trained models on arXiv papers at an annual basis. We demonstrate the effectiveness of AnnualBERT models by showing that they not only have comparable performances in standard tasks but also achieve state-of-the-art performances on domain-specific NLP tasks as well as link prediction tasks in the arXiv citation network. We then utilize probing tasks to quantify the models' behavior in terms of representation learning and forgetting as time progresses. Our approach enables the pretrained models to not only improve performances on scientific text processing tasks but also to provide insights into the development of scientific discourse over time. The series of the models is available at https://huggingface.co/jd445/AnnualBERTs.